Take-home Exercise 03

Author

Zhou Ao

Published

June 18, 2023

1. Overview

1.1 Background

FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.

FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure. Can you help FishEye develop a new visual analytics approach to better understand fishing business anomalies

1.2 The Task

1.3 Reference

Reference to Mini-Challenge 3 of VAST Challenge 2023.

2. The Data

2.1 Getting Started

Install R packages needed for data preparation, data wrangling, data analysis and visualisation using the code chunk below.

pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, graphlayouts, ggforce, tidytext, tidyverse, skimr)

2.2 Data Import

The code chunk below imports data using fromJSON() from jsonlitepackage into R environment.

mc3_data <- fromJSON("data/MC3.json")

2.3 Data Wrangling

2.3.1 The edges data

The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.

mc3_edges <- as_tibble(mc3_data$links) %>%
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
  summarise(weights = n()) %>%
  filter(source != target) %>%
  ungroup()

2.3.2 The nods data

The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.

mc3_nodes <- as_tibble(mc3_data$nodes) %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  select(id, country, type, revenue_omu, product_services)

Instead of using the nodes data table extracted from mc3_data, we will prepare a new nodes data table by using the source and target fields of mc3_edges data table. This is necessary to ensure that the nodes in nodes data tables include all the source and target values.

id1 <- mc3_edges %>%
  select(source) %>%
  rename(id = source)

id2 <- mc3_edges %>%
  select(target) %>%
  rename(id = target)

mc3_nodes1 <- rbind(id1, id2) %>%
  distinct() %>%
  left_join(mc3_nodes, unmatched = "drop")

2.3.3 EDA

Exploring the edges data

Display the statistics summary of mc3_edges tibble data frame using skim() from skimr package per code chunk below.

skim(mc3_edges)
Data summary
Name mc3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁

We can tell there is no missing value form the report above.

Display the mc3_edges tibble data frame as an interactive table on the html document using datatable() from DT package per code chunk below.

DT::datatable(mc3_edges)
ggplot(data = mc3_edges, aes(x = type)) +
  geom_bar()

Exploring the nodes data

Display the statistics summary of mc3_nodes tibble data frame using skim() from skimr package per code chunk below.

skim(mc3_nodes)
Data summary
Name mc3_nodes
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 4 1737 0 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 21515 0.22 1822155 18184433 3652.23 7676.36 16210.68 48327.66 310612303 ▇▁▁▁▁

We can tell there is no missing value form the report above.

Display the mc3_nodes tibble data frame as an interactive table on the html document using datatable() from DT package per code chunk below.

DT::datatable(mc3_nodes)
ggplot(data = mc3_nodes, aes(x = type)) +
  geom_bar()

3. Visualisation and Analysis

3.1 Initial Network Visualisation

3.1.1 Network model with tidygraph

mc3_graph <- tbl_graph(nodes = mc3_nodes1,
                       edges = mc3_edges,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_betweenness())
mc3_graph %>%
  filter(betweenness_centrality >= 100000) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha=0.5)) +
  geom_node_point(aes(size=betweenness_centrality, colors="lightblue", alpha=0.5)) +
  scale_size_continuous(range = c(1, 10)) +
  theme_graph()

3.1.2 Text sensing with tidytext

Simple word count

Calculating the number of word “fish” appeared in the field product_services using the code chunk below.

mc3_nodes %>%
  mutate(n_fish = str_count(product_services, "fish")) %>%
  arrange(desc(n_fish))
# A tibble: 27,622 × 6
   id                          country type  revenue_omu product_services n_fish
   <chr>                       <chr>   <chr>       <dbl> <chr>             <int>
 1 Taylor LLC                  ZH      Comp…     138982. Fish (anchovy, …     11
 2 Gvardeysk Sextant ОАО Cargo Uziland Comp…      73027. Fish salads (It…     11
 3 Mclaughlin, Valdez and Moo… ZH      Comp…       6154. Fresh, cured, o…      8
 4 Murphy, Fisher and Barnes   ZH      Comp…     256739. Frozen fish blo…      7
 5 Garcia, Lloyd and Houston   ZH      Comp…      53509. Bottom fishes; …      7
 6 SeaSelect Foods Salt spray  Marebak Comp…      41902. European whole …      7
 7 Ancla del Este Sagl         Oceanus Comp…      16167. Bottom fishes, …      7
 8 Monroe, Smith and Miller    ZH      Comp…     250255. Frozen processe…      6
 9 Arunachal Pradesh s S.A. d… Marebak Comp…      60346. Offers a wide r…      6
10 suō yú Ltd. Liability Co    Coralm… Comp…      31567. Offers a wide r…      6
# ℹ 27,612 more rows

Tokenisation

In the code chunk, using unnest_token() from tidytext package to split text in product_services field into words.

token_nodes <- mc3_nodes %>%
  unnest_tokens(word, product_services)

Now we can visualise the words extracted by using the code chunk below.

token_nodes %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x=word, y=n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x="Count", y = "Unique words", title = "Count of unique words found in product_services field")

Noticint that the top 15 most frequent words contains a few stop words, e.g. “and” and “of”.

Removing stopwords

We will use the stop_words function from tidytext package to clean up stop words.

stopwords_removed <- token_nodes %>%
  anti_join(stop_words)
Note
  • The anti_join() from dplyr package is used to remove all stop words.

Then we can visualise the words extracted using the code chunk below.

stopwords_removed %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x=word, y=n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x="Count", y = "Unique words", title = "Count of unique words found in product_services field")